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Abstract 

Background: Short oligonucleotides can be used as markers to tag and track DNA sequences. For example, 
barcoding techniques [i.e. Multiplex Identifiers or Indexing) use short oligonucleotides to distinguish between reads 
from different DNA samples pooled for high-throughput sequencing. A similar technique called molecule tagging 
uses the same principles but is applied to individual DNA template molecules. Each template molecule is tagged 
with a unique oligonucleotide prior to polymerase chain reaction. The resulting amplicon sequences can be traced 
back to their original templates by their oligonucleotide tag. Consensus building from sequences sharing the same 
tag enables inference of original template molecules thereby reducing effects of sequencing error and polymerase 
chain reaction bias. Several independent groups have developed similar protocols for molecule tagging; however, 
user-friendly software for build consensus sequences from molecule tagged reads is not readily available or is highly 
specific for a particular protocol. 

Results: MT-Toolbox recognizes oligonucleotide tags in amplicons and infers the correct template sequence. On a 
set of molecule tagged test reads, MT-Toolbox generates sequences having on average 0.00047 errors per base. 
MT-Toolbox includes a graphical user interface, command line interface, and options for speed and accuracy 
maximization. It can be run in serial on a standard personal computer or in parallel on a Load Sharing Facility based 
cluster system. An optional plugin provides features for common 16S metagenome profiling analysis such as 
chimera filtering, building operational taxonomic units, contaminant removal, and taxonomy assignments. 

Conclusions: MT-Toolbox provides an accessible, user-friendly environment for analysis of molecule tagged reads 
thereby reducing technical errors and polymerase chain reaction bias. These improvements reduce noise and allow 
for greater precision in single amplicon sequencing experiments. 
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Background 

High-throughput sequencing has revolutionized bio- 
logical science and biomedical research. However, erro- 
neous base calls reduce the information value of each 
sequence, and polymerase chain reaction (PGR) bias 
leads to inaccurate quantification of sequences. To ad- 
dress these limitations several methods have been devel- 
oped where randomly generated oligonucleotides are 
used as a molecule tag (MT). Molecule tagging should 
not be confused with barcoding (Le, Multiplex Identifiers 
or Indexing) where short oligonucleotides are used to 
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tag individual samples, which are then pooled and simul- 
taneously sequenced. The resulting reads are then infor- 
matically sorted by the sample barcode. Molecule 
tagging is a similar idea where unique tags are attached 
to individual DNA template molecules within a sample 
prior to exponential PGR amplification (Additional file 
1: Figure SI. A). After PGR and sequencing, reads sharing 
the same MT likely originated from the same template 
molecule, meaning that discrepancies among these reads 
can be attributed to technical error. Forming consensus 
sequences (GonSeqs) from reads with the same MT cor- 
rects these errors. Additionally, any preferential PGR 
amplification biases are mitigated because GonSeqs rep- 
resent the original population of templates [1-3]. 

Molecule tagging is useful for a variety of applications. 
For instance, Kinde et aL [4] used molecule tagging to 
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test polymerase fidelity, accuracy of in vitro synthesized 
oligonucleotides, and prevalence of mutations in nuclear 
and mitochondrial genomes of normal cells. Jabara et al 
[5] used molecule tagging to detect and quantify single 
nucleotide polymorphisms (SNPs) in the HIV— 1 prote- 
ase gene in complex viral populations. Kivioja et al [1] 
showed how molecule tagging improves quantification 
of mRNA sequencing experiments. Faith et al, [6] used a 
molecule tagging method called Low-Error Amplicon 
Sequencing (LEA-Seq) for metagenomic 16S gut profil- 
ing and observed a substantial reduction in the observed 
microbial community complexity due to the elimin- 
ation of spurious sequences. Lundberg et al [7] saw a 
similar reduction in 16S microbial complexity when 
profiling microbially diverse bulk soil samples. In each 
of these studies, molecule tagging allowed greater 



confidence in the amplicon sequences and their 
quantification. 

Despite extensive efforts developing and using these 
error-reducing protocols, software for building ConSeqs 
in the previously cited projects [4-6] is not readily avail- 
able or is highly specific for a particular application 
{e,g, [5]). For example, LEA-Seq scripts can only be 
run on a small number of 16S ampUcons sequenced 
using paired-end 108 bp Illumina reads with a single 
12-20 bp molecule tag. This specificity makes LEA-Seq 
scripts less practical for most amplicon experiments that 
could benefit from molecule tagging. Consequently, we 
developed MT-Toolbox (Molecule Tag Toolbox), a flexible 
and user-friendly software package to generate ConSeqs 
from molecule tagged reads produced from several differ- 
ent MT protocols. 
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Figure 1 MT-Toolbox overview. Single-end or paired-end (overlapping or non-overlapping) reads can be input into MT-Toolbox. Overlapping 
paired-end reads are merged after which all reads, regardless of their type, are categorized by their MT. Next a square alignment matrix is created 
for each MT category using either an MSA algorithm (e.g. ClustalW, MUSCLE) or by read stacking. From these matrices, ConSeqs are built and 
quality control measures remove low-quality ConSeqs. When using the MT-MT-Toolbox plugin, ConSeqs are subjected to traditional 16S profiling 
analyses including OTU clustering, chimera filtering, contaminant filtering, and assigning taxonomy. 
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Implementation 

The primary purpose of MT-Toolbox is to categorize 
reads by MT and build ConSeqs (Figure 1). MT-Toolbox 
can categorize and correct single-end (SE), overlapping 
paired-end (PE), and non-overlapping PE reads. With 
overlapping PE reads, a preprocessing step runs FLASH 
[8] to merge corresponding PE reads into a single se- 
quence. Regular expressions, a common pattern match- 
ing technique, are used to identify the expected regions 
(e.g, MT, primer, amplicon) of each read (Additional 
file 1: Figure Sl.B-D). Reads matching the regular expres- 
sion are then categorized by their MT. 

The first step in identifying and correcting errors among 
reads having the same MT is building a square alignment 
matrix, M (Additional file 1: Note S.l). This matrix is de- 
scribed as each row, r, representing a read and each 
column, c, containing a single base from read r at position 
c. The number of rows (i.e. number of reads) in M is re- 
ferred to as the MT depth and is an important parameter 
in evaluating the accuracy of final ConSeqs. Multiple se- 
quence alignment (MSA) programs such as ClustalW [9] 
or MUSCLE [10] can be used to generate such a matrix. 
However, computational overhead from operations like 
file input/output associated with these programs requires 
a substantial amount of time (Additional file 1: Figure S2). 
Alternatively, M can be created without using an MSA 
program by simply stacking reads. Because reads in an 
MT category are likely to originate from the same tem- 
plate molecule, they are likely to have uniform lengths 



(Additional file 1: Figure S3). Furthermore, Illumina 
sequences rarely incorporate insertions or deletions into 
sequenced reads (Additional file 1: Figure S4). In the rare 
case where reads in a single MT category differ in lengths, 
reads are clustered by length and only reads from the 
largest cluster are used to build the ConSeq. If multiple 
clusters are equally represented as the largest cluster, 
one of them is arbitrarily chosen to build the ConSeq. 
Comparisons of ConSeqs generated by ClustalW, MUSCLE, 
and the read stacking method show that ConSeqs de- 
rived from stacked reads are only slightly less accurate 
(Additional file 1: Figure S5), and reduce runtime by -54%. 
While options for using either ClustalW or MUSCLE 
are available in MT-Toolbox, the default is to stack 
reads. 

From M a consensus sequence can be built by choos- 
ing the mode base in each column (Additional file 1: 
Figure S6). The quality score of the consensus base is set 
to be the mean of the original quality values of the mode 
base. Ties are resolved by choosing the base with the 
highest average quality score. If a tie cannot be resolved 
using quality scores, an lUPAC encoding is used as the 
consensus base. Using quality score information provides 
a major advantage because ConSeqs can be generated 
from MTs represented by only two reads thereby keep- 
ing a larger proportion of reads. This is especially import- 
ant for samples with high amplicon population diversity 
because it captures a larger fraction of the population. 
This is an improvement over LEA-Seq, which cannot 
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Figure 2 Read and MT counts per sample. Here reads are defined as the number of raw reads that can be categorized ("categorizable"). In 
other words, the read matches the expected regular expression pattern for merged reads (Additional file 1: Figure Sl.C). MT-Toolbox assigns each 
categorizable read to an MT category. MT counts are the number of MT categories {i.e. number of originally tagged DNA templates). The sum 
total of reads in each MT category equals the number of categorizable reads. 
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build ConSeqs from MTs having a depth of two where 
the reads are not identical Furthermore, other ConSeq 
building software {e.g, [5]) use only sequence information 
to build consensus sequence and thus are only able to 
generate ConSeqs from MTs having a depth greater than 
two reads. 

The primary output file contains ConSeqs and corre- 
sponding quality scores in FASTQ format. A second 
FASTQ file contains single read categories (SRCs; MTs 
with only one raw read) that can optionally be included 
in downstream analysis (Additional file 1: Note S.2). 
However, SRCs retain all technical errors associated with 
sequencing and PCR because no consensus sequence 
can be generated from a single read. Quality control pa- 
rameters (Additional file 1: Note S.3) allow filtering 
of low quality ConSeqs and SRCs, ConSeqs with low 
depth, and ConSeqs where a single MT tags two dif- 
ferent templates by chance {Le, the 'birthday paradox) 
[11,12] (Additional file 1: Note S.4, Figures S7 and S8). 

MT-Toolbox also includes the following features: 
1) jobs can be started via a graphical user interface 
(GUI) or command line interface (Additional file 1: 
Figure S9), 2) an additional plugin provides features 
for 16S microbial profiling, namely— building operational 



taxonomic units (OTUs), assigning OTU taxonomy, and 
removing contaminant OTUs (Figure 1, Additional file 1: 
Note S.5) using the MeTagenomics plugin (MT-MT-Tool- 
box. Additional file 2), 3) the BioUtils library (Additional 
file 1: Note S.6, Figure SIO; Additional file 3), digital 
normalization [13] parameters (Additional file 1: Note S.7), 
and optional Load Sharing Facility (LSF) based cluster 
parallelization (Additional file 1: Note S.8) reduce runtime 
and memory requirements, and 4) an MT-Toolbox website 
provides descriptions, tutorials, installation instructions, 
updates, and other important documentation [14]. 

MT-Toolbox is implemented as a suite of object- 
oriented Perl modules and scripts (Additional file 4). It 
has been successfully tested on Perl versions 5.8.8, 5.8.9, 
and 5.12.3. Several external Perl modules are required, 
and can be easily downloaded and installed via a sim- 
ple build command before building and installing MT- 
Toolbox. The GUI was built using the Perl/Tk library 
and requires an X Window System. MT-Toolbox also 
uses gnuplot 4.4 for generating simple summary graphs. 
The optional MT-MT-Toolbox plugin allows for standard 
16S microbial profiling analysis. MT-MT-Toolbox re- 
quires USEARCH V7.0.1090 [15] or greater for OTU clus- 
tering and chimera filtering, the RDP Classifier [16] as 
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Figure 3 MT depth histograms for each sample. The number of reads in each MT category influences the accuracy of the resulting ConSeqs. 
MTs with higher depth are likely to generate more accurate ConSeqs. Diluting samples helps generate more MTs with higher depth at the cost 
of reducing the amplicon diversity. 
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implemented in QIIME [17] for OTU taxonomy classifica- 
tion, and BLAST + 2.2.25 [18] for contaminant sequence 
removal. 

Results 

To show the utility of ConSeqs generated by MT-Toolbox, 
we used data from [7] which consists of a clonal plasmid 
containing a known 16S gene. From this single clonal plas- 
mid, separate DNA samples were created by performing 
two replicate dilutions of Ix, 50x, or lOOx, for a total of six 
samples. Each sample was molecule tagged and PGR amp- 
lified. It is important to note that each sample should 
contain just one "real" amplicon matching the original 
16S amplicon in the clonal plasmid. Samples were barcoded, 
pooled, and sequenced on an Illumina MiSeq platform using 
standard 2 x 250 bp protocols. After demultiplexing samples, 
ConSeqs were generated using MT-Toolbox (Figure 2). 

Diluted samples result in a greater number of MTs 
having high depth (Figure 3); the diversity of the original 
amplicon population, however, is reduced. In theory, 
molecule tags having high depth should generate the 



most accurate ConSeqs by overcoming the effects of se- 
quencing error. This creates a trade-off between creating 
highly accurate ConSeqs and capturing the diversity of 
the amplicon population. Low- complexity samples bene- 
fit from dilution because a large number of accurate 
ConSeqs can be created without sacrificing information 
about the diversity of the amplicon population. Alterna- 
tively, for high complexity samples like soil microbial 
communities it may be better to sacrifice ConSeq accuracy 
to observe a larger portion of the amplicon population. 

The most accurate ConSeqs were generated from merged 
PE reads. Average errors per base (EPB) for ConSeqs built 
from ClustalW or Muscle alignments of merged PE reads 
was measured at -0.00047. Without using an MSA {Le. the 
read stacking method), ConSeqs had -0.001 12 EPB. Re- 
moving ConSeqs with a c-score <35 reduced the EPB 
to -0.00089 (Additional file 1: Note S.4, Figures S7 and S8). 
In general, ConSeqs derived from any type of molecule 
tagged read were more accurate than any type of raw read 
(Figure 4; Additional file 1: Note S.9). Also, as depth of 
coverage for each MT increased, consensus sequence 
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Figure 4 ConSeqs error profile. MT-Toolbox derived ConSeqs have fewer EPB than raw reads, and errors within ConSeqs are reduced as MT 
depth increases. MT-Toolbox ConSeqs generated from overlapping PE reads are the most accurate even at low MT depths. ConSeqs built from 
forward and reverse reads are slightly less accurate then overlapping PE ConSeqs. Raw reads of any type are the least accurate. MT-Toolbox 
ConSeqs generated from PE reads at depths <10 (88% of the data) are more accurate than those generated by LEA-Seq. EPB were calculated by 
averaging individual EPB of ConSeqs having the same depth. Error bars represent standard error and grow in length as depth increases due to 
fewer MTs having high depths (Figure 3). 
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EPB decreased. However, for depths >15, EPB increases 
slightly. This trend is primarily driven by a single sample 
(Additional file 1: Figure Sll). Why this particular sample 
has higher mean EPB is unclear, however it is unlikely to 
be caused by sample contamination or sequencing error 
(Additional file 1: Figure S12). In general, this outlying 
sample appears more error prone even at depths where 
other samples have very few errors (Additional file 1: 
Figure Sll). Two examples of errors in high depth 
ConSeqs from this sample suggest that nucleotide misin- 
corporation during early PGR cycles contributes to in- 
creased EPB (Additional file 1: Figure SI 2). Additionally, 
this sample has relatively fewer ConSeqs at these high 
depths (Figure 3) indicating that outlier ConSeqs may be 
inflating the mean EPB. In any case, nearly all ConSeqs in 
this sample still have fewer EPB than raw reads. 

In general, MT-Toolbox outperforms LEA-Seq in terms 
of accuracy (Figure 4) and data retention. For MT depths 
of 2 (21% of the data), LEA-Seq failed to generate ConSeqs 
because it is unable resolve difference between only two 
reads. Alternatively, MT-Toolbox uses read quality scores 
to resolve such difference thereby retaining MTs of depth 2. 
For MT depths between 3 and 10 (67% of the data), mean 
EPB of MT-Toolbox ConSeqs generated from merged 
PE reads is lower than LEA-Seq ConSeqs. For the 
remaining MT depths (12% of the data) MT-Toolbox 
ConSeqs were either on par or slightly less accurate than 
LEA-Seq ConSeqs. Furthermore, because LEA-Seq was 
implemented to recognize only specific amplicons, a sub- 
stantial number of changes to the source code were re- 
quired to run LEA-Seq on these reads. This highlights the 
utility of MT-Toolbox where users can easily adjust pa- 
rameters to build ConSeqs from virtually any amplicon or 
sequencing technology. 

Conclusions 

Molecule tagging is a practical technique applicable to a 
variety of amplicon sequencing experiments, however 
generalizable and easily accessible software for processing 
custom MT data is not readily available. MT-Toolbox fills 
this niche by enabling efficient processing of MT data 
generated from a variety of protocols (Additional file 1: 
Note S.IO). By forming ConSeqs, MT-Toolbox reduces 
technical errors and biases associated with PCR and se- 
quencing thereby enabling precise measurements of DNA 
template abundance in mixed amplicon samples. 

Availability and requirements 

• Project Name: MT-Toolbox. 

• Project Home Page: https://sites.google.com/site/ 
moleculetagtoolbox/. 

• Operation System(s): Unix-based system. 

• Programming Language: Perl. 



• Other Requirements: Perl 5.8.8 or higher, select 
Perl libraries, ClustalW 2.1, MUSCLE 3.8.31, 
FLASH 1.2.5, gnuplot 4.4, an X Window System, 
select Unix system commands. 

• License: FreeBSD. 

• Any restrictions to use by non-academics: None. 
Additional files 



Additional file 1: Supplementary Information. Note S.I. Building the 

Alignment Matrix. Note S.2. Single Read Categories. Note S.3. Optimizing 
ConSeq Accuracy. Note S.4. Filtering 'birthday paradox' ConSeqs Using the 
c-score. Note S.5. MT-MT_Toolbox (MeTagenomics Edition). Note S.6. 
BioUtils. Note S.7. Digital Normalization. Note S.8. Cluster Parallelization. 
Note S.9. Clonal Plasmid Accuracy. Note S.IO. Protocols Compatible with 
MT-Toolbox. Figure SI. The implementation of molecular tags used in 
Lundberg et al. 2013. Figure S2. Runtime in CPU seconds of ClustalW and 
MUSCLE for MT categories of different depths. Figure S3. The length 
distribution of reads is very narrow. Figure S4. The number and types of 
errors seen in ConSeqs generated without using an MSA algorithm 
(i.e. using stacked reads) for all clonal plasmid samples. Figure S5. 
Accuracy of ConSeqs generated from ClustalW, MUSCLE, or stacked 
reads (i.e. no MSA) alignments. Figure S6. A general schematic of 
how five overlapping PE molecule tagged reads are used to generate 
highly accurate consensus sequences. Figure S7. c-score distributions for 
ConSeqs generated using different methods. Figure S8. The correlation 
between c-score and read errors. Figure S9. A screen shot for the GUI for 
the basic version of MT-Toolbox. Figure S10. For FASTQ file 10, BioUtils is 
significantly faster and scales better than BioPerl. Figure S1 1 . Errors per base 
profile for individual samples for merged PE reads where ConSeqs are built 
without using an MSA. Figure SI 2. Higher errors per base in sample lOOx B 
are unlikely to be caused by contamination or sequencing errors. 

Additional file 2: Source code for BioUtils-v1.0.9. 

Additional file 3: Source code for IVIT-IVIT-Toolbox-v4.1.0. 

Additional file 4: Source code for IVIT-Toolbox-v4.1.0 
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